chore(flushing): standardize code with refactoring on some flushers and retries by duncanista · Pull Request #1018 · DataDog/datadog-lambda-extension

duncanista · 2026-02-04T21:14:35Z

Overview

Simplify code for flushing, trying to standardize everything by avoiding code all over the place, ensuring that we only create one client and we can reuse as much as possible for performance improvements

Motivation

SVLS-8507

Copilot

Pull request overview

This PR refactors the trace flushing code to standardize and simplify HTTP client management. The main goal is to create a single, reusable HTTP client instance that can be shared across multiple flush operations, improving performance through connection pooling and TLS session reuse.

Changes:

Removed trait-based abstractions for TraceFlusher and StatsFlusher in favor of concrete types
Extracted HTTP client creation logic into a shared hyper_client module
Added lazy initialization of HTTP clients using OnceCell for caching and reuse

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
bottlecap/src/traces/trace_flusher.rs	Removed `TraceFlusher` trait and `ServerlessTraceFlusher` implementation; added cached HTTP client with `OnceCell`; moved HTTP client creation to separate module
bottlecap/src/traces/stats_flusher.rs	Removed `StatsFlusher` trait and `ServerlessStatsFlusher` implementation; added cached HTTP client with `OnceCell`; updated to use shared `hyper_client` module
bottlecap/src/traces/mod.rs	Added new `hyper_client` module to public exports
bottlecap/src/traces/hyper_client.rs	New module containing shared HTTP client creation logic and type definitions
bottlecap/src/flushing/service.rs	Removed generic type parameters from `FlushingService` to use concrete flusher types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T21:15:34Z

+                let client_clone = http_client.clone();
                batch_tasks.spawn(async move {
-                    Self::send(traces_clone, Some(&endpoint), &proxy_https, &tls_cert_file).await
+                    Self::send_traces(traces_clone, Some(endpoint), client_clone).await


Passing Some(endpoint) here while passing None on line 121 creates an inconsistency. The endpoint variable is already an Endpoint from the loop, but send_traces expects Option<Endpoint>. Consider restructuring send_traces to accept &Endpoint directly and have a separate internal method or branch for the default endpoint case to make the API clearer.

Seems like a pre-existing bug

To be addressed on another PR

duncanista · 2026-02-05T03:38:16Z

@lym953 might need a review from you, I'm adding retries on stats:-) amongst other things

duncanista · 2026-02-05T03:39:01Z

-}
-
-impl ServerlessTraceFlusher {
-    pub fn get_http_client(


this created a client on every call

we were creating a client every time when flushing traces, now we just use one, also removes unnecessary traits as we are not creating more tracing agents for other use cases

lym953

Thanks for the refactoring and for adding comments!

lym953 · 2026-02-11T19:04:46Z

Is the a limit on the number of retries?

duncanista · 2026-02-12T20:14:00Z

Is the(re) a limit on the number of retries?

I think for now it's 2, not the standard on other services

## Overview Continuation of #1018 removing unnecessary mut lock on callers for dogstatsd

… Lambda ## Problem After upgrading from extension v92 to v93, customers reported a sharp increase in "Max retries exceeded, returning request error" errors (SVLS-8672, GitHub issue #1092). ## Root Cause PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model: 1. Lambda executes, HTTP client created with connection pool 2. Extension flushes traces, connections remain open in pool 3. Lambda freezes (paused between invocations - seconds to minutes) 4. Lambda resumes, cached client reuses stale connections 5. TCP errors → "Max retries exceeded" In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse. ## Solution Disable connection pooling by setting `pool_max_idle_per_host(0)`. This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching. This matches the pattern used in libdatadog's `new_client_periodic()` which explicitly disables pooling with the comment: "This client does not keep connections because otherwise we would get a pipe closed every second connection because of low keep alive in the agent." Fixes: SVLS-8672 Fixes: #1092 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… Lambda (#1094) ## Summary Fixes a regression introduced in v93 where customers see a sharp increase in "Max retries exceeded, returning request error" errors after upgrading from v92. - Disables HTTP connection pooling for the trace/stats flusher by setting `pool_max_idle_per_host(0)` - Prevents stale connections from being reused after Lambda freeze/resume cycles ## Problem PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model: 1. Lambda executes, HTTP client created with connection pool 2. Extension flushes traces, connections remain open in pool 3. Lambda **freezes** (paused between invocations - can be seconds to minutes) 4. Lambda **resumes**, cached client reuses stale connections 5. TCP errors → "Max retries exceeded" In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse. ## Solution Disable connection pooling by setting `pool_max_idle_per_host(0)`. This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching (TLS session reuse, configuration reuse, etc.). This matches the pattern used in libdatadog's `new_client_periodic()` which explicitly disables pooling with the comment: > "This client does not keep connections because otherwise we would get a pipe closed every second connection because of low keep alive in the agent." ## Related - Fixes [SVLS-8672](https://datadoghq.atlassian.net/browse/SVLS-8672) - Fixes #1092 - Regression introduced in #1018 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>

duncanista requested a review from a team as a code owner February 4, 2026 21:14

duncanista requested review from Copilot and removed request for a team February 4, 2026 21:14

Copilot AI reviewed Feb 4, 2026

View reviewed changes

duncanista requested a review from lym953 February 5, 2026 03:37

duncanista commented Feb 5, 2026

View reviewed changes

duncanista mentioned this pull request Feb 5, 2026

chore(deps): upgrade dogstatsd #1020

Merged

Base automatically changed from jordan.gonzalez/flushing/create-service to main February 5, 2026 21:32

duncanista added 6 commits February 5, 2026 16:36

refactor types and methods

2aa9540

we were creating a client every time when flushing traces, now we just use one, also removes unnecessary traits as we are not creating more tracing agents for other use cases

never panic

cd090c7

add comments on additional endpoints to fix later

068c835

add todo comment

12bcb5f

add stats retry

b8b40d2

update docs

7590bd2

duncanista force-pushed the jordan.gonzalez/flushing/standardize-mechanisms branch from 030fd5d to 7590bd2 Compare February 5, 2026 21:36

duncanista changed the title ~~chore(flushing): standardize code with refactoring on trace flushers~~ chore(flushing): standardize code with refactoring on some flushers and retries Feb 5, 2026

lym953 approved these changes Feb 11, 2026

View reviewed changes

duncanista merged commit b4bb433 into main Feb 12, 2026
45 of 46 checks passed

duncanista deleted the jordan.gonzalez/flushing/standardize-mechanisms branch February 12, 2026 21:02

duncanista added a commit that referenced this pull request Feb 18, 2026

chore(deps): upgrade dogstatsd (#1020)

674cd35

## Overview Continuation of #1018 removing unnecessary mut lock on callers for dogstatsd

duncanpharvey pushed a commit that referenced this pull request Mar 10, 2026

chore(deps): upgrade dogstatsd (#1020)

d2c3cd8

## Overview Continuation of #1018 removing unnecessary mut lock on callers for dogstatsd

jchrostek-dd mentioned this pull request Mar 11, 2026

fix(http): disable connection pooling to prevent stale connections in Lambda #1094

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(flushing): standardize code with refactoring on some flushers and retries#1018

chore(flushing): standardize code with refactoring on some flushers and retries#1018
duncanista merged 6 commits intomainfrom
jordan.gonzalez/flushing/standardize-mechanisms

duncanista commented Feb 4, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

duncanista Feb 4, 2026

Uh oh!

duncanista Feb 4, 2026

Uh oh!

duncanista commented Feb 5, 2026

Uh oh!

duncanista Feb 5, 2026

Uh oh!

lym953 left a comment

Uh oh!

lym953 commented Feb 11, 2026

Uh oh!

duncanista commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

duncanista commented Feb 4, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Motivation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

duncanista Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

duncanista Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

duncanista commented Feb 5, 2026

Uh oh!

duncanista Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

lym953 left a comment

Choose a reason for hiding this comment

Uh oh!

lym953 commented Feb 11, 2026

Uh oh!

duncanista commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

duncanista commented Feb 4, 2026 •

edited by atlassian Bot

Loading